一：线性回归 (Linear Regression) - Fainle的博客

一 . 概述

在统计学中，线性回归是利用线性回归方程的最小二乘函数对一个或多个自变量和因变量之间关系的一种分析。

一个自变量称为一元回归。

多个自变量称为多元回归。

二 . 所属分类

线性回归一般属于监督学习。

三 . 数学相关

一般用梯度下降的方法来拟合线性回归的线，让其尽量的均匀分割散列点。

其中以下是常用的损失函数，用于计算最小化损失。

1. Mean Absolute Error (平均绝对误差)

$Error = \frac{1}{m}\sum_{i=1}^M|y-\hat{y}|$

缺点: 容易造成收敛缓慢或不收敛

2. Mean Squared Error (均方误差)

$Error = \frac{1}{2m}\sum_{i=1}^M(y-\hat{y})^2$

缺点: 容易受到较大异常值干扰而使斜率偏向较大异常值造成欠拟合

四 . 梯度下降示例代码

import numpy as np
import matplotlib.pyplot as plt

data = np.loadtxt('data.csv', delimiter=',')  # 加载数据
X = data[:, :-1]  # 变量
y = data[:, -1]  # 值


# 绘制数据点 用于观察一下大致分布
plt.scatter(X, y, marker='.')
plt.show()

# 转换数据集
data = np.hstack((np.ones((data.shape[0], 1)), data))
X_train = data[:, :-1]
y_train = data[:, -1].reshape((-1, 1))


def hypothesis(X, theta):
    """
    进行预测
    """
    return np.dot(X, theta)


# function to compute gradient of error function w.r.t. theta
def gradient(X, y, theta):
    """
    计算梯度
    """
    h = hypothesis(X, theta)
    grad = np.dot(X.transpose(), (h - y))
    return grad


def cost(X, y, theta):
    """
    计算损失函数值
    """
    h = hypothesis(X, theta)
    J = 1 / 2 * np.dot((h - y).transpose(), (h - y))
    return J[0]


def gradient_descent(X, y, learning_rate=0.001, batch_size=25):
    """
    梯度下降算法
    """
    history_cost = []
    theta = np.zeros((X.shape[1], 1))
    n_points = X.shape[0]

    for _ in range(batch_size):
        batch = np.random.choice(range(n_points), batch_size)

        X_batch = X[batch, :]
        y_batch = y[batch]

        theta = theta - learning_rate * gradient(X_batch, y_batch, theta)
        history_cost.append(cost(X_batch, y_batch, theta))

    return theta, history_cost


theta, error_list = gradient_descent(X_train, y_train, batch_size=1000)
print("Bias = ", theta[0])
print("Coefficients = ", theta[1:])

# visualising gradient descent
plt.plot(error_list)
plt.xlabel("Number of iterations")
plt.ylabel("Cost")
plt.show()

y_pred = hypothesis(X_train, theta)
#
plt.scatter(X, y, marker='.')
plt.plot(X_train[:, 1], y_pred, color='orange')
plt.show()

五 . 线性回归示例代码 (sklearn)

import matplotlib.pyplot as plt
import pandas as pd

from sklearn.linear_model import LinearRegression

data = pd.read_csv('data.csv', delimiter = ',')  # 加载数据

X = data.iloc[:,0].to_frame()
y = data.iloc[:,1].to_frame()

lr_model = LinearRegression()
reg = lr_model.fit(X, y)
y_pred = reg.predict(X)

plt.scatter(X, y, marker='.')
plt.plot(X, y_pred, color='orange')
plt.show()

六 . 多元线性回归

多元线性回归基本特征和单元一直，但多元线性回归无法可视化。

多元线性回归例子

from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston

boston_data = load_boston()
x = boston_data['data']
y = boston_data['target']

model = LinearRegression()

reg = model.fit(x, y)

sample_house = [[2.29690000e-01, 0.00000000e+00, 1.05900000e+01, 0.00000000e+00, 4.89000000e-01,
                6.32600000e+00, 5.25000000e+01, 4.35490000e+00, 4.00000000e+00, 2.77000000e+02,
                1.86000000e+01, 3.94870000e+02, 1.09700000e+01]]

reg.predict(sample_house)

一些问题

1 数据分布非线形的时候线性回归不理想

2 数据异常值较大时会引起欠拟合

多项式回归

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

train_data = pd.read_csv('data1.csv')
X = train_data['Var_X'].values.reshape(-1, 1)
y = train_data['Var_Y'].values

poly_feat = PolynomialFeatures(degree = 2)
X_poly = poly_feat.fit_transform(X)

poly_model = LinearRegression(fit_intercept = False).fit(X_poly, y)

y_pred = poly_model.predict(X_poly)

plt.scatter(X, y, marker='.')
plt.plot(X, y_pred, color='blue')
plt.show()